Prosper Loans Exploration by Gabor Sar

1 Abstract

I am going to analyse the loan data from Prosper. I would like to know what properties can describe a loan and what parameters can affect those properties. Is the amount the most important, or the monthly payment? Is the amount affected by the credit score of the borrower? The key goal of this analysis is to find answers to these questions.

2 Introduction

Prosper (an American peer to peer lending marketplace) provides a daily snapshot of their loan data. There are 113937 loans in the dataset with 81 features. As there are a lot of features, I am going to subset the data to those that I am interested in:

  1. loan original amount
  2. monthly loan payment
  3. term
  4. loan origination date
  5. stated monthly income
  6. debt to income ratio
  7. employment status
  8. employment status duration
  9. credit scorer range lower
  10. credit score range upper
  11. borrower rate
  12. borrower APR
  13. listing category
  14. loan status

Loan original amount, monthly payment and term are the most important features for this analysis. Those describe how much people borrow, how much they have to pay back, and how quickly. I would like to know what underlying trends those features have, and what kind of relationships they have with each other and other features (like how they changed over the time). I also would like to know whose are the borrowers and if their characteristics have any relation to the first three factors.

Listing category contains integer numbers. In order to make plotting by it easier I am going to convert it into a factor, using the following levels:

Number Level
0 Not Available
1 Debt Consolidation
2 Home Improvement
3 Business
4 Personal Loan
5 Student Use
6 Auto
7 Other
8 Baby&Adoption
9 Boat
10 Cosmetic Procedure
11 Engagement Ring
12 Green Loans
13 Household Expenses
14 Large Purchases
15 Medical/Dental
16 Motorcycle
17 RV
18 Taxes
19 Vacation
20 Wedding Loans

Loan origination date is a factors of strings. Converting it to dates could also make plotting easier.

3 Univariate Analysis

##  [1] "LoanOriginalAmount"       "MonthlyLoanPayment"      
##  [3] "Term"                     "LoanOriginationDate"     
##  [5] "StatedMonthlyIncome"      "DebtToIncomeRatio"       
##  [7] "CreditScoreRangeLower"    "CreditScoreRangeUpper"   
##  [9] "BorrowerRate"             "BorrowerAPR"             
## [11] "EmploymentStatus"         "EmploymentStatusDuration"
## [13] "ListingCategory"          "LoanStatus"
## 'data.frame':    113937 obs. of  14 variables:
##  $ LoanOriginalAmount      : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ MonthlyLoanPayment      : num  330 319 123 321 564 ...
##  $ Term                    : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanOriginationDate     : Date, format: "2007-09-12" "2014-03-03" ...
##  $ StatedMonthlyIncome     : num  3083 6125 2083 2875 9583 ...
##  $ DebtToIncomeRatio       : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ CreditScoreRangeLower   : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper   : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ BorrowerRate            : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ BorrowerAPR             : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ EmploymentStatus        : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration: int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ ListingCategory         : Factor w/ 21 levels "Not Available",..: 1 3 1 17 3 2 2 3 8 8 ...
##  $ LoanStatus              : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...

3.1 Loan Original Amount

The most important property of a loan is the original amount.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The mininmum is $1,000.00, the maximum is $35,000.00, and the median as $6.500,00.

Most of the loan original amount values are between $1,000.00 and $10,000.00.

There are some values that are more frequent than the others. To find out which values, I am going to list the most frequent ones:

##    4000   15000   10000    5000    2000    3000   25000   20000    1000 
##   14333   12407   11106    6990    6067    5749    3630    3291    3206 
##    2500 (Other) 
##    2992   44166

Based on the list of the ten most frequent values, it seems like approximately every five thousandth.

Setting the binwidth to 5000 shows a decreasing trend.

3.2 Monthly Loan Payment

The second most important property of a loan is the monthly payment.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2252.0

This minimum is $0.00, the maximum is $2,252.00, and the median is $217,7.00.

The majority of the monthly payments are between $0.00 and $500.00.

There are some outliers.

I am going to limit the monthly payment to the .99 percentile and set the binwidth to 10.

Based on this plot a lot of loans have a monthly payment of $0.00.

Count 0 values:

## 
##  FALSE   TRUE 
## 113002    935

935 loans have a monthly payment of $0.00. Let’s see what is the status of those loans.

## 
##              Cancelled             Chargedoff              Completed 
##                      0                      0                    800 
##                Current              Defaulted FinalPaymentInProgress 
##                      0                    131                      4 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                      0                      0                      0 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                      0                      0                      0

All of those loans are completed, defaulted or the final payment is in progress.

3.3 Term

Term is also a critical variable, as that is the most likely to have a strong relationship with the amount or the monthly payment.

Number of loans per term:

## 
##    12    36    60 
##  1614 87778 24545

Proportion of loans per term:

## 
##   12   36   60 
## 0.01 0.77 0.22

77% of the loans last 36 months (3 years), 21% last 60 months (5 years) and 1% last 12 months (1 year).

3.4 Loan Origination Date

I am interested in the changes of the different variables over the time. First, let’s see how the number of loans changed.

There is an increasing trend in the number of loans from 2006 to late 2008 and from late 2009 to 2014, and there is a gap between late 2008 and late 2009.

Enlarging that timeframe shows that there were almost 0 loans registered in an approximately 10-months period. The most likely reason for this anomaly is the subprime mortgage crisis.

3.5 Stated Monthly Income

I would like to know how much is the monthly income of a borrower.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

The minimum is $0.00, the maximum is $1,750,000.00, and the median is $4,667.00. The difference between the median and the maximum is significant.

The histogram of monthly incomes shows a serious outlier issue. This supports my feelings about the big difference between the median ($4,667.00), and the maximum ($1,750,003.00).

Limiting the values to the .99 quantile and setting the binwidth to 150 shows a much better, positively skewed, normal distribution.

The only thing that does not seem obvious is the high frequency of the 0 values.

Number of 0 values:

## 
##  FALSE   TRUE 
## 112543   1394

There are 1394 loans in the dataset with 0 stated monthly income.

## 
##      Not Available Debt Consolidation   Home Improvement 
##                238                358                 47 
##           Business      Personal Loan        Student Use 
##                249                121                 67 
##               Auto              Other      Baby&Adoption 
##                 28                183                  1 
##               Boat Cosmetic Procedure    Engagement Ring 
##                  1                  1                  4 
##        Green Loans Household Expenses    Large Purchases 
##                  2                 56                  6 
##     Medical/Dental         Motorcycle                 RV 
##                 17                  0                  0 
##              Taxes           Vacation      Wedding Loans 
##                  4                  8                  3

Listing category does not explain the zero values.

## 
##              Cancelled             Chargedoff              Completed 
##                      1                    347                    677 
##                Current              Defaulted FinalPaymentInProgress 
##                    259                     80                      0 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                      1                     11                      5 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                      3                      7                      3

Most of these loans are completed, charged off or defaulted, but still 259 of them are current.

3.6 Debt to Income Ratio

Let’s have a look at the how much a borrower have to spend on paying debts back.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The minimum is 0.000, the maximum is 10.010, and the median is 0.220.

There are some outliers.

Most of the debt to income ratio values are between 0.14 and 0.32.

The histogram shows a positively skewed, normal distribution.

3.7 Employment Status

I would like to know if there is any trend in them employment status of the borrowers, and later if it has any relationship to the original amount or monthly payment of the loans.

Number of loans per employment status:

## 
##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

Proportion of loans per employment status:

## 
##                    Employed     Full-time Not available  Not employed 
##          0.02          0.59          0.23          0.05          0.01 
##         Other     Part-time       Retired Self-employed 
##          0.03          0.01          0.01          0.05

89% of the borrowers employed or retired (Employed, Full-time, Part-time, Retired, Self-employed), <1% Not employed, and there is no useful employment information about 10% of them (NA, Not available, Other).

3.8 Employment Status Duration

Let’s see the length of the employment statuses.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

The miminum is 0.00, the maximum is 755.00, and the median is 67.00.

Most of the employment status durations are between 0 and 200 months, and the distribution shows a decreasing trend.

3.9 Credit Score

Credit score represents the creditworthiness of a borrower, and it is used by banks and other lenders to evaluate the risk of a loan. Therefore, I would like to see how it correlates with the properties of peer-to-peer loans.

The dataset contains both credit score range lower and credit score range upper variables. I am going to check if there is any difference between the two, or I can omit one of them in my analysis.

Summary of (credit score range upper - credit score range lower) values:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      19      19      19      19      19      19     591

Number of (credit score range upper - credit score range lower) values:

## credit_score_range_difference
##     19 
## 113346

The difference of credit score range upper and credit score range lower is always 19. Therefore, I can omit one of those values. I am going to use credit score range lower in my analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591

The mimimum is 0.0, the maximum is 880.0, and the median is 680.0.

Most of the credit score range lower values are between 600 and 800.

All values are multiples of 10.

The distribution seems almost normal.

There are some outliers.

3.10 Borrower Rate

Borrower rate is the borrower’s interest rate for a loan.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

The miminum is 0.0000, the maximum is 0.4975, and the median is 0.1840.

Most frequent values:

## Source: local data frame [2,294 x 2]
## 
##    BorrowerRate count
## 1        0.3177  3672
## 2        0.3500  1905
## 3        0.3199  1651
## 4        0.2900  1508
## 5        0.2699  1319
## 6        0.1500  1182
## 7        0.1400  1035
## 8        0.1099   949
## 9        0.2000   907
## 10       0.1585   806
## ..          ...   ...

Most borrower rate values are between 0.1 and 0.3.

The values at 0.32 and 0.35 are unexpectedly frequent.

There are some outliers.

3.11 Borrower APR

Borrower APR is the borrower’s annual rate for a loan. It is the total cost of a loan and it includes borrower rate as well.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

The miminum is 0.00653, the maximum is 0.51230, and the median is 0.20980.

Most frequent values:

## Source: local data frame [6,678 x 2]
## 
##    BorrowerAPR count
## 1      0.35797  3672
## 2      0.35643  1644
## 3      0.37453  1260
## 4      0.30532   902
## 5      0.29510   747
## 6      0.35356   721
## 7      0.29776   707
## 8      0.15833   652
## 9      0.24246   605
## 10     0.24758   601
## ..         ...   ...

Most borrower APR values are between 0.1 and 0.3.

The values at 0.32 and 0.35 are unexpectedly frequent.

There are some outliers.

Based on the distributions, there is a clear relationship between borrower rate and borrower APR. As borrower APR includes borrower rate, this is not a surprise.

3.12 Listing Category

I would like to see what the borrowers used their loans for.

## 
##      Not Available Debt Consolidation   Home Improvement 
##              16965              58308               7433 
##           Business      Personal Loan        Student Use 
##               7189               2395                756 
##               Auto              Other      Baby&Adoption 
##               2572              10494                199 
##               Boat Cosmetic Procedure    Engagement Ring 
##                 85                 91                217 
##        Green Loans Household Expenses    Large Purchases 
##                 59               1996                876 
##     Medical/Dental         Motorcycle                 RV 
##               1522                304                 52 
##              Taxes           Vacation      Wedding Loans 
##                885                768                771

I have used square-root transformation to make the histogram more readable.

Most loans have a listing category of debt consolidation.

A lot of loans does not have a useful listing category (Not Available, Other).

3.13 Loan Status

What is the current status of the loans in the dataset?

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

I have used square-root transformation to make the histogram more readable.

Most of the loans have a loan status of current, completed, charged off or defaulted. All the other statuses have a very low occurrence in the dataset.

4 Bivariate Analysis

##                          LoanOriginalAmount MonthlyLoanPayment        Term
## LoanOriginalAmount               1.00000000         0.93198368  0.33892746
## MonthlyLoanPayment               0.93198368         1.00000000  0.09102578
## Term                             0.33892746         0.09102578  1.00000000
## StatedMonthlyIncome              0.20125947         0.19683026  0.02847925
## DebtToIncomeRatio                0.01011222         0.02759840 -0.01467005
## EmploymentStatusDuration         0.09814935         0.08116016  0.08247591
## CreditScoreRangeLower            0.34087445         0.29253205  0.12626345
## CreditScoreRangeUpper            0.34087445         0.29253205  0.12626345
## BorrowerRate                    -0.32895995        -0.24474235  0.02008537
## BorrowerAPR                     -0.32288669        -0.22665287 -0.01118347
##                          StatedMonthlyIncome DebtToIncomeRatio
## LoanOriginalAmount                0.20125947        0.01011222
## MonthlyLoanPayment                0.19683026        0.02759840
## Term                              0.02847925       -0.01467005
## StatedMonthlyIncome               1.00000000       -0.12265939
## DebtToIncomeRatio                -0.12265939        1.00000000
## EmploymentStatusDuration          0.06983037       -0.01160926
## CreditScoreRangeLower             0.10790082       -0.01316852
## CreditScoreRangeUpper             0.10790082       -0.01316852
## BorrowerRate                     -0.08898180        0.06291678
## BorrowerAPR                      -0.08233849        0.05632742
##                          EmploymentStatusDuration CreditScoreRangeLower
## LoanOriginalAmount                    0.098149347            0.34087445
## MonthlyLoanPayment                    0.081160161            0.29253205
## Term                                  0.082475906            0.12626345
## StatedMonthlyIncome                   0.069830374            0.10790082
## DebtToIncomeRatio                    -0.011609265           -0.01316852
## EmploymentStatusDuration              1.000000000            0.08113411
## CreditScoreRangeLower                 0.081134109            1.00000000
## CreditScoreRangeUpper                 0.081134109            1.00000000
## BorrowerRate                         -0.019907440           -0.46156668
## BorrowerAPR                          -0.008588601           -0.42970732
##                          CreditScoreRangeUpper BorrowerRate  BorrowerAPR
## LoanOriginalAmount                  0.34087445  -0.32895995 -0.322886690
## MonthlyLoanPayment                  0.29253205  -0.24474235 -0.226652867
## Term                                0.12626345   0.02008537 -0.011183469
## StatedMonthlyIncome                 0.10790082  -0.08898180 -0.082338491
## DebtToIncomeRatio                  -0.01316852   0.06291678  0.056327417
## EmploymentStatusDuration            0.08113411  -0.01990744 -0.008588601
## CreditScoreRangeLower               1.00000000  -0.46156668 -0.429707322
## CreditScoreRangeUpper               1.00000000  -0.46156668 -0.429707322
## BorrowerRate                       -0.46156668   1.00000000  0.989823970
## BorrowerAPR                        -0.42970732   0.98982397  1.000000000

Original amount have a strong correlation with monthly payment (0.93198368), and a weak correlation with term (0.33892746), credit score (0.34087445), borrower rate (-0.32895995) and borrower APR (-0.32288669).

Monthly payment have a weak correlation with credit score (0.29253205), borrower rate (-0.24474235) and borrower APR (-0.22665287).

Creddt score have a weak correlation with borrower rate (-0.46156668) and borrower APR (-0.42970732).

There is a very strong correlation between borrower rate and borrower APR (0.989823970). As previously mentioned, borrower APR includes borrower rate, so this correlation is expected.

In this section I am going to analyse the relationships above, the changes in the different features over the time and the relationships between the original amount, monthly payment and listing category and employment status. I also would like to see how employment status duration differs between employment status values, to get a better picture about the borrowers.

4.1 Loan Original Amount and Monthly Payment

## 
##  Pearson's product-moment correlation
## 
## data:  loans$MonthlyLoanPayment and loans$LoanOriginalAmount
## t = 867.8179, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9312165 0.9327426
## sample estimates:
##       cor 
## 0.9319837

Original amount and monthly payment have a very strong (0.9319837) positive correlation. Based on the scatterplot there are three strong linear relationships between them. That means that monthly payment cannot describe the variation of original amount alone, there must be something else participating in it. Later in my analysis I am going to try to find that other participant.

4.2 Loan Original Amount and Term

## 
##  Pearson's product-moment correlation
## 
## data:  loans$Term and loans$LoanOriginalAmount
## t = 121.5996, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3337778 0.3440569
## sample estimates:
##       cor 
## 0.3389275

There is a weak correlation between original amount and term (0.3389275). The only visible difference is that the values under £10,000.00 are less frequent if the term is 60 months. That means that longer loans do not necessary have a higher original amount, as I was expecting it.

4.3 Loan Original Amount and Credit Score

## 
##  Pearson's product-moment correlation
## 
## data:  loans$CreditScoreRangeLower and loans$LoanOriginalAmount
## t = 122.0719, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3357190 0.3460095
## sample estimates:
##       cor 
## 0.3408745

There is a weak correlation between original amount and credit score (0.3408745). The scatterplot does not provide any extra information.

4.4 Loan Original Amount and Borrower Rate

## 
##  Pearson's product-moment correlation
## 
## data:  loans$BorrowerRate and loans$LoanOriginalAmount
## t = -117.5822, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3341283 -0.3237719
## sample estimates:
##        cor 
## -0.3289599

There is a weak correlation between original amount and borrower rate (-0.3289599). The scatterplot does not provide any extra information.

4.5 Loan Original Amount and Borrower APR

## 
##  Pearson's product-moment correlation
## 
## data:  loans$BorrowerAPR and loans$LoanOriginalAmount
## t = -115.1434, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3280787 -0.3176752
## sample estimates:
##        cor 
## -0.3228867

There is a weak correlation between original amount and borrower APR (-0.3228867). The scatterplot does not provide any extra information.

4.6 Monthly Payment and Credit Score

## 
##  Pearson's product-moment correlation
## 
## data:  loans$CreditScoreRangeLower and loans$MonthlyLoanPayment
## t = 102.9909, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2871995 0.2978465
## sample estimates:
##      cor 
## 0.292532

There is a weak correlation between monthly payment and credit score (0.292532). The scatterplot does not provide any extra information.

4.9 Monthly Payment and Borrower Rate

## 
##  Pearson's product-moment correlation
## 
## data:  loans$BorrowerRate and loans$MonthlyLoanPayment
## t = -85.2021, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2501933 -0.2392759
## sample estimates:
##        cor 
## -0.2447424

There is a weak correlation between monthly payment and borrower rate (-0.2447424). The scatterplot shows multiple underlying relationships between them. Later in my analysis, I would like to look into this deeper.

4.8 Monthly Payment and Borrower APR

## 
##  Pearson's product-moment correlation
## 
## data:  loans$BorrowerAPR and loans$MonthlyLoanPayment
## t = -78.5406, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2321545 -0.2211368
## sample estimates:
##        cor 
## -0.2266529

There is a weak correlation between monthly payment and borrower APR (-0.2266529). The scatterplot shows very similar relationships to the previous (monthly payment and borrower rate), with a little bit more noise.

4.9 Credit Score and Borrower Rate

## 
##  Pearson's product-moment correlation
## 
## data:  loans$BorrowerRate and loans$CreditScoreRangeLower
## t = -175.1695, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

There is a weak correlation between credit score and borrower rate (-0.4615667). The scatterplot does not provide any extra information.

4.10 Credit Score and Borrower APR

## 
##  Pearson's product-moment correlation
## 
## data:  loans$BorrowerAPR and loans$CreditScoreRangeLower
## t = -160.2137, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4344422 -0.4249487
## sample estimates:
##        cor 
## -0.4297073

There is a weak correlation between credit score and borrower APR (-0.4297073). The scatterplot does not provide any extra information.

4.11 Borrower Rate and Borrower APR

## 
##  Pearson's product-moment correlation
## 
## data:  loans$BorrowerRate and loans$BorrowerAPR
## t = 2347.699, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9897057 0.9899409
## sample estimates:
##      cor 
## 0.989824

There is a very strong correlation between borrower rate and borrower APR (0.989824), and the scatterplot supports that. As previously described this is due to the fact, that borrower APR includes borrower rate. The scatterplot shows multiple relationships. I would like to know what makes them different.

4.12 Loan Original Amount and Loan Origination Date

Now let’s see how different variables changed over the time.

There was no significant change in the original amount before the gap at late 2008, and then before 2011. In 2011, the smaller loans disappeared and in 2013 the popularity of larger loans increased.

4.13 Monthly Loan Payment and Loan Origination Date

The pattern of changes in monthly payment over the time is very similar to the pattern of changes in original amount, but with much more noise.

4.14 Term and Loan Origination Date

I have used square-root transformation to make the frequency polygon more readable.

There is a strong relationship between origination date and term. All the loans before the previously mentioned gap (between late 2008 and late 2009) have a 36-months term. 12-months loans were only occurring for a 2-years period between 2011 and 2013 and were not popular. 60-months loans started to occur at the same time as 12-months loans (in 2011), and there is an increasing trend in the number of those since then.

4.15 Stated Monthly Income and Loan Origination Date

There was no significant change in stated monthly income over the time.

4.16 Debt to Income Ratio and Loan Origination Date

There was no significant change in debt to income ratio over the time.

4.17 Employment Status and Loan Origination Date

There is no useful employment status information before late 2006. The Employed category almost replaced the Full Time category in late 2010, this is probably due to a change in the way the data was recorded.

4.18 Employment Status Duration and Loan Origination Date

The black line represents the smoothed conditional mean of employment status duration over the time.

The employment status durations have an increasing trend over the time.

4.19 Credit Score and Loan Origination Date

In 2006, all credit score values became greater than 500, and in late 2013, greater than 630. There was no significant change in the frequency of the values greater than 800.

4.20 Borrower Rate and Loan Origination Date

There was a significant drop in the maximum borrower rate at the beginning of 2006 (from .36 to .24). The drop was followed by an increase at early 2006 (from 0.24 to 0.29) and at late 2007 (from .19 to .36). There were no too many changes between late 2009 and 2011 when the maximum decreased again (to .32) and the data become much noisier. Between 2012 and 2013 the data became less noisy again, and in the second half of 2013 the values between .1 and .2 became significantly more frequent.

4.21 Borrower APR and Loan Origination Date

Once again, the scatterplot of borrower APR and origination date seems almost the same as the scatterplot of borrower rate and origination date, but whit more noise.

4.22 Listing Category and Loan Origination Date

I have used square-root transformation to make the frequency polygon more readable.

There is no useful listing category information before 2008. New listing categories were introduced in 2008 and 2012 as well, this indicates a change in the way the data was recorded, similarly to employment status.

There is no visible pattern within listing categories.

4.23 Loan Status and Loan Origination Date

I have used square-root transformation to make the frequency polygon more readable.

Almost all the loans that are current were originated after 2011. Most of the completed loans were originated before late 2008, or after late 2009 and before 2013. The majority of the chargedoff loans were originated before the previously mentioned gap.

4.24 Loan Original Amount and Listing Category

Debt Consolidation, Baby&Adoption and Business have the highest, Motorcycle, Vacation and Household Expenses have the lowest original amount.

4.25 Monthly Loan Payment and Listing Category

Loans for DebtConsolidation have the highest monthly payment, followed by NotAvailable, HomeImprovement, Business and Other.

4.26 Loan Original Amount and Employment Status

Employed borrowers have loans with the highest, part-time employed borrowers have with the lowest original amount.

4.27 Monthly Loan Payment and Employment Status

Employed borrowers have loans with the highest, part-time employed borrowers have with the lowest monthly payment.

4.28 Employment Status Duration and Employment Status

There is a relationship between the employment status and the employment status duration. Employed, Full-time, Retired and Self-employed statuses tend to have a higher duration as Not employed or Part-time.

5 Multivariate Analysis

In this section I am going to try to answer the questions I raised in the previous section.

5.1 Loan Original Amount and Monthly Payment

There are three linear relationships between original amount and monthly payment. Based on the relatively strong correlation between term and monthly payment, and knowing that there are three possible term values, term seems the best next variable to investigate.

## 
##  Pearson's product-moment correlation
## 
## data:  MonthlyLoanPayment and LoanOriginalAmount/Term
## t = 1386.25, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9712849 0.9719350
## sample estimates:
##       cor 
## 0.9716118

Monthly loan payment almost equals to the original amount divided by the term of the loan. Therefore monthly loan payment can be described by original amount and term together.

5.2 Monthly Payment and Borrower Rate

There are multiple linear relationships between monthly payment and borrower rate.

Adding term to the plot may help again.

Faceting by term shows some difference between the previously seen relationships.

As term and original amount has a relationship to monthly loan payment adding original amount to the previous plot may make the picture clearer.

Faceting by term and colouring by original amount shows a complex relationship, were both borrower rate, orginal amount and term participates in monthly payment.

5.3 Borrower Rate and Borrower APR

There are multiple linear relationships between borrower rate and borrower APR.

Let’s try facetting by term again.

Faceting by term cleans the picture a little bit, but there are still mulitple relationships within the cells.

As credit score has the strongest correlation with borrower APR (-0.429707322) it may can describe some of the variation.

Coloring by credit score does not explain anything, let’s try original amount, as it has the second strongest correlation with borrower APR (-0.322886690).

Original amount cannot help either.

6 Final Plots

6.1 First Plot

I have chosen this chart because it shows how the three most important features correlate, and how two of them can describe the third one.

There is a very strong positive relationship between the term, the original amount and the monthly payment of a loan. The latter can be calculated from the formers. The monthly payment of a loan increases as the term decreases - short term loans have higher monthly payment than long term loans - and increases as the original amount increases.

6.2 Second Plot

The chart above shows how borrower rate affects the monthly payment, even if there is only a weak correlation between them (-0.24474235). It shows the importance of good visualisation that can reveal underlying relationships between different features.

There is a positive, linear relationship between borrower rate and monthly payment. Loans that have higher borrower rate, also have a higher monthly payment. This relationship is not affected by either the term or the original amount of a loan. That means borrower rate also affects the monthly payment, together with the term and the original amount.

6.3 Third Plot

I have choosen this chart because it shows how the range of monthly loan payment varies between different listing categories. It shows that the purpose of a loan is also important, just like the amount and the term of it.

Monthly loan payment vary differently across listing categories. Some categories has more wide monthly payment range (e.g., Debt Consolidation), others has less wide payment range (e.g., Motorcycle).

7 Reflection

The prosper loan dataset contains 113937 observations of 83 variables from 2006 to 2014. At the beginning of my investigation I was trying to understand the meaning of different variables, find issues with how they stored and how can I make the best use of them. As the dataset contains a lot of variables, it was challenging to decide where to start. Finally, I have chosen to start investigation the variables that best representing a loan: original amount, term and monthly payment, and everything else that can have a relationship with them. First I started to plot each variable one by one to have an initial view about what the dataset contains, and how good the quality of it. I was very surprised when I saw that there were loans with 0 monthly payment, but later I realised that those loans are not current anymore. It was also unexpected that there were no loans in the dataset between late 2008 and late 2009 but after some research I understood that the reason for the gap is the subprime mortgage crisis. Later, during the analysis of the relationships between the different variables I realized that the monthly payment is the most important property of a loan, instead of the original amount as I thought before. I found that the term and the original amount of a loan the most relevant parameters to calculate the monthly payment. Borrower rate has an important role as well, however it was quite hard to visualize that relationship. I also found that there is a difference in the monthly payment, and the variation of the monthly payment across listing categories. Analyzing the changes of the variables over the time shown me that there were a few changes in the way how was the data collected. That made it hard to use some categorical variables, like employment status. The standardization of the mentioned categorical variables would lead to an investigation that could make a better use of them. To get a better picture about the borrowers and their choices, I would analyse the dataset without the loans that are for debt consolidation. Investigating the loans that are paid by debt consolidation would be useful as well. To improve this analysis, I would look deeper into the reason of the 0 stated monthly income values. I would also investigate the unexpectedly frequent values of borrower rate and borrower APR (0.32, 0.35), and I also would be interested in building a linear model to predict the monthly payment and the original amount.